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Abstract. - In this paper we investigate the nature and structure of the relation between imposed 
classifications and real clustering in a particular case of a scale-free network given by the on-line 
encyclopedia Wikipedia. We find a statistical similarity in the distributions of community sizes 
both by using the top-down approach of the categories division present in the archive and in 
the bottom-up procedure of community detection given by an algorithm based on the spectral 
properties of the graph. Regardless the statistically similar behaviour the two methods provide a 
rather different division of the articles, thereby signaling that the nature and presence of power 
laws is a general feature for these systems and cannot be used as a benchmark to evaluate the 
suitability of a clustering method. 



I Many real systems can be modeled by means of a scale- 
free network [1,2]. By such mathematical representation 
it is often possible to better understand the development 
of these systems and possibly to discover some unexpected 
ly-^ behaviour. Much scientific interest has recently focussed 
^*~^Qn their community structure, often revealed by highly 
clustered regions of a network. Dividing a network into 
communities of nodes sharing some given property gives 
^-H a coarse grained representation of the whole system. The 
C 'paramount example of such is given by information net- 
CD works such as the World Wide Web (WWW). The WWW 
^ig a network composed by html documents connected by 
hyperlinks and for its giant structure only partial stud- 
^"y^ ies on its community structure have been produced [3,4]. 
Once a large information network such as the WWW is 
decomposed into communities, data mining can be per- 
formed in a more efficient way by restricting the data 
search to smaller regions of the WWW where the desired 
information is more probable to be found. 

Another well-known example of information network 
is the on-line, user-generated encyclopedias Wikipedia 
available at http://www.wikipedia.org in several lan- 
guages. Articles of each encyclopedia can be represented 
as nodes, and the hyperlinks from an article to another 
within the Wikipedia form directed networks shaped by 
the article creations and edits of thousands of individ- 
ual users around the world. The versions of Wikipedia 
we explored display statistical properties [5-7] typical of 
complex networks such as the WWW, whose Wikipedia is 
a subset, even though their microscopic growth processes 
differ noticeably: while in the first case users need "ad- 



ministrator" rights to edit webpages, Wikipedia articles 
can be edited by any user. 

Wikipedia networks have varying sizes depending on the 
language and activity of the underlying users' community 
and ranging from a few hundreds to more than one million 
articles. Here we present an analysis of a sample of this set 
of graphs (hereafter Wikigraphs) collected in September of 
2007 from the web site http://download.wikimedia.org/. 
In particular, we study similarities and differences between 
two possible classifications of Wikipedia articles: their in- 
ternal categorization and the partition, by a suitable al- 
gorithm, of the network formed by articles and hyperlinks 
between them. 

Wikipedia articles are gathered into categories accord- 
ing to their topics. The classification of articles, the cre- 
ation or the deletion of categories are decided upon the 
agreement of the whole Wikipedia community. In turn, 
categories are organised hierarchically according to their 
generality. However, articles and categories do not strictly 
form a perfect tree, since an article or a category may 
happen to be the child of more than one parent category. 
Therefore, the taxonomy of articles can be represented a 
direct acyclic graph [8] . 

Much work has been devoted to the study of the statisti- 
cal features of taxonomies, in order to understand whether 
their overall properties could reveal any general pattern 
of organization. In most of the cases, one observes power- 
law distributions in the number of offsprings that can be 
explained by means of Yule processes or by the inherent 
properties of supercritical trees. The first explanation has 
been proposed for taxonomies of natural species [9] show- 
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ing a power-law decay in the frequency of the number of 
species for a given genus. Based on such data, Yule [10] 
introduced a model to explain how mutations in a popu- 
lation of individuals may eventually form a series of dif- 
ferent species in the same genus. The results of this pro- 
cess have a rather good agreement with the observed data. 
Yule processes represent a fimdamcntal mechanism in the 
production of power laws, though they do not reproduce 
completely the richness of the scale- invariance presented in 
natural taxonomies. Indeed, when looking at the statisti- 
cal distribution of the sizes of trees (which corresponds to 
the distribution of genera in the same family and different 
families in the same order) a similar power-law relation 
has also been found [11, 12]. In this case, the value of 
the power-law exponent may also depend upon the ob- 
served ecosystem type [13]. Following this experimental 
evidence, one may decide to model the development of the 
whole hierarchical tree of the taxonomy by using a random 
branching process. It has been analytically shown that the 
subtree size distribution of a random tree displays a power- 
law decay, P{s) oc s^'^ . The exponent r is 3/2 for critical 
random trees, where the branching number is 1 [14], and 
equal to 2 [15] if the branching number is larger than 1 
as it is often the case in several real instances of growing 
networks. Therefore, the presence of power laws with an 
exponent nearby 2 can be considered just a consequence 
of the parent-child structure of a taxonomy [16]. 

Beside their classification in categories, Wikipedia ar- 
ticles may also be clustered by the analysis of the net- 
work that, through hyperlinks, connect them. Such task 
is nowadays performed by a number of algorithms [17]. 
Methods based on edge betweenness and clustering coeffi- 
cient assume that edges lying on most of the shortest paths 
in the graph or with low clustering coefficient are likely 
to connect separate communities. By recursively deleting 
the edges with larger betweenness or low clustering, the 
graph splits into its communities [18,19]. Methods that 
optimise the network modularity, instead, form cluster of 
nodes so that the density of link within the communities 
are maximized against the number of links among com- 
munities [20,21]. Finally, spectral methods are based on 
the analysis of the eigenvalues and eigenvectors of suit- 
ably chosen functions of adjacency matrix A, whose size 
is given by the number of vertices n in a graph and whose 
elements are equal to 1 if an edge exists between nodes 
i and j and zero otherwise [20,22,23]. 

While any of such method can be applied in small 
graphs, unfortunately they turn to be unusable in larger 
networks since they require exceeding computational re- 
sources or time. Though, the detection of strongly inter- 
connected communities of nodes in a network can still be 
achieved by finding the attraction basins of random walks 
on the graph. This is obtained through the method we 
adopted in our investigation, the MCL algorithm, which 
provides a fast response in a reasonable time even for net- 
works including thousands of nodes, and can be tuned op- 
portunely in order to maintain its efficiency for even larger 



systems. However, it has to be noted that the MCL al- 
gorithm too is unable to cluster the larger available Wiki- 
graphs. 

The MCL algorithm [25,26] finds the partition of a net- 
work at the desired resolution as follows: 

1. start with the transition matrix A of the network 
and normalise each column of the matrix to obtain 
a stochastic matrix S; 

2. compute S"^; 

3. take the p*'' power [p > 1) of every element of and 
normalise each column to one; 

4. go back to step 2. 

The physical meaning of this procedure is the following: 
through step 2 we compute the probability that a random 
walk visits edges two steps apart the starting position. If a 
walk starts within a communities, with greater probability 
it will remain inside it. By raising these probabilities to 
a power (step 3) and then normalising them, we enhance 
these paths with respect to the others. The effect is to 
create a statistical matrix S' corresponding to an adja- 
cency matrix (and hence a graph) in which edges between 
communities are removed. 

After some iterations, MCL converges to a matrix 
Smcl(p) which is invariant under transformations 2 and 3. 
Only a few lines of Smcl(p) have non-zero entries, yield- 
ing the nodes' clusters as separated basins (there is in 
general exactly one non-zero entry per column) . As noted 
above, the step 3 reinforces the high probability walks at 
short time scale at the expense of the low probability ones. 
The whole process of iteration, on physical grounds, of the 
MCL algorithm corresponds to simulating many random 
walks on the networks and strengthening their flow where 
it is already strong and weakening it where it is weak. The 
parameter p tunes the granularity of the clustering. If p is 
large, the effect of step 3 becomes stronger and the random 
walks are likely to end up in a greater number of smaller 
basins of attraction, or communities. On the other hand, 
a small p produces larger communities. In the limit of 
p = 1, only one cluster is found. The MCL method, thus, 
has a parameter to be tuned, determining the resolution 
of the resulting division of the network. In order to com- 
pare the communities emerging from the MCL analysis 
and the taxonomy established by Wikipedia contributors, 
we set such parameter p to produce approximately the 
same number of categories observed in the data. 

In our work, we have investigated the relation between 
the category structure of Wikipedia and the clustering 
properties of the underlying graph representing articles 
and hyperlinks between them. The category system is 
a tool to let users browse the content of Wikipedia with 
greater ease. Due to the large amount of information to be 
handled by users, a self-organised categorization system 
would help in classifying pages without human interven- 
tion and discussion. Clustering methods, often based on 
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the topology of a graph, are used at this aim, especially to 
deal with user-generated content on the WWW [24,27,28]. 
However, the application of automatic clustering methods 
in each specific context has to be validated by compar- 
ing their yielding to the results of manual indexing. The 
aim of this paper, thus, is to compare and discuss the 
partition of the graph based on the built-in taxonomy, 
i.e. the categories, and the one obtained by means of the 
MCL algorith, applied to the network of the Wikipedia 
pages connected by internal links. The various data sets 
analyzed can be downloaded from the archive of the en- 
cyclopedia (http://download.mcdiawiki.org). Wc selected 
some data dumps selected according the number of ar- 
ticles S. In particular, the largest set we considered is 
the English Wikipedia (En, 2,042,361 articles), while the 
smallest one is the Norman one (Nrm, 2,750 articles), as 
reported in table 1. They span several order of magni- 
tude in the number of nodes, ranging at that time from 
the few thousands articles of the Norman archive to the 
millions articles of the English one. This way, we are able 
to check whether finite size effects affect our observations. 
For each Wikipedia, we analyzed the datasets reporting 
the category structure and the internal link structure. 



Language 


Wikipedia 


Articles 


English 


En 


2042361 


German 


De 


650241 


Italian 


It 


357538 


Norwegian 


No 


134943 


Catalan 


Ca 


81660 


Danish 


Da 


70757 


Croatian 


Hr 


35932 


Galician 


Gl 


28113 


Simple English 


Simple 


19921 


Latin 


La 


15602 


Neapolitan 


Nap 


12603 


Occitan 


Oc 


10359 


Afrikaan 


Af 


8443 


Aragon(>s(> 


An 


7144 


Venetian 


Vec 


5974 


Corsican 


Co 


5324 


Interlingua 


la 


3652 


Alemannic 


Als 


3141 


Norman 


Nrm 


2750 



Table 1: The Wikipedia versions sampled in the analysis and 
their size. 

We start by considering the category and cluster size 
distributions, i.e. the distribution of the number of 
Wikipages contained in a category or in a cluster. The 
statistical properties of Wikipedia taxonomies appear to 
display a remarkable regularity in the size distribution of 
categories. As shown in Fig.l the category size distribu- 
tion P{s) is heavy-tailed, following approximately a power 
law P{s) oc with 7 ~ 2.2 for very different sizes of the 
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Fig. 1: The frequency of category sizes in a sample of the 
Wikigraphs analyzed here. The solid line represents s~^'^. 

system. 

We then applied the MCL algorithm to measure the 
size distribution of topology-based communities. Unfortu- 
nately, our survey has to limit itself to Wikipedia smaller 
than a given size, above which the problem of clustering 
the network becomes computationally intractable. Inter- 
estingly, the clustering-based partition we obtain follow a 
very similar cluster size distribution with a power-law de- 
cay for large values of s, as it can be observed in figure 2. 
In the experiment, we have tuned the granularity parame- 
ter of the MCL algorithm in order to obtain approximately 
the same number of communities and categories. 

Nonetheless, Fig. 2 shows only a similar partition struc- 
ture, which does not necessarily imply that the partitions 
themselves are similar. To compare the two partitions, 
wc adopt as a measiire the adjusted Rand index as it has 
been recently generalised to soft partitions [29] . Standard 
Rand index [30] results from a pairwise comparisons of the 
elements in two different partitions P,Q. If we denote as 

a: Number of pairs of data objects belonging to the same 
class in P and to the same class in Q. 

b: Number of pairs of data objects belonging to the same 
class in P and to different classes in Q. 

c: Number of pairs of data objects belonging to different 
classes in P and to the same class in Q. 

d: Number of pairs of data objects belonging to different 
classes in P and to different classes in Q. 

we can compute the Rand index R as 

^ ^ a + d ^ 2{a + d) 
a + b-{-c + d n{n- 1) 

since a + b + c + d is given by the total mmiber n{n — l)/2 
of pairs in the system. This is a measure of the agree- 
ment between partitions since terms a and d contribute 
to consistent classifications (agreements), whereas terms b 
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Fig. 2: The category size distribution (triangles) compared to 
the cluster size distribution (circles) obtained by the MCL al- 
gorithm for the Danish (a), Croatian (b), Galician (c), Simple 
English (d), Latin (e), Neapolitan (f) Wikipedia. 
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Fig. 3: The overlap between category-based and clustering- 
based partitions measured by the Adjusted Rand Index for a 
sample of Wikipedia networks. The red solid line represents 
the expected value for two random partitions. 



and c arc measures of ineonsistent elassifications (disagree- 
ments) . Unfortunately, in the case of a partition composed 
by many clusters, the d element dominates such that the 
quantity R can be close to 1 even if the partitions substan- 
tially differs. To overcome this, the adjusted Rand index 
i?" has been introduced, 

^ _ ia+c){a+b) 

r> a+b+c+d 

" 2a+b+c _ (a+c){a+b) ' V^'' 

a-\-b-\-c-\-d a-\-b-\-c-\-d 

which is equal to if the two partitions P and Q are 
randomly drawn [31]. 

It has to be noticed that articles in Wikipedia can lie 
in more than one category. Therefore, the taxonomy has 
to be treated as a soft partition, i.e. a partition where 
classes intersection is not null and elements can belong to 
more than one class with varying intensity. Accordingly, 
we adopted the generalization of the adjusted Rand index 
for fuzzy partitions recently introduced [29] . 

The Adjusted Rand Index takes very different values 
when measured in different systems. Moreover, its value 
seems to be uncorrclatcd with the network size, as re- 
ported in figure 3 where only articles assigned to at least 
one category are taken into account. This shows that the 
categorization of Wikipedia articles docs not necessarily 
correspond to the clustering patterns emerging from the 
MCL algorithm. The latter, in fact, could results in some 
case in a quite different organization of knowledge. 

From all the above analysis we can conclude that the 
two divisions of the graphs represent truly different local 
and global processes on the network, depending upon the 
decentralised users' action and the consensual collective 
choice respectively. This behaviour does not reflects into 
a different frequency distribution P{s) of the category and 
cluster sizes. Rather, this quantity is distributed with the 
same scale-invariant distribution given by P{s) oc s~^'^. 
This suggests that the presence of power-laws in these 



quantities is more related to the fractal nature of the 
branching in the category structure [15] when approaching 
the problem from top to down, or conversely to the Zipf's 
law [32] when considering the inverse bottom-up process of 
cluster formation. The varying agreement between clus- 
tering and categorization across the studied versions of 
Wikipedia suggests that links in Wikipedia do not nec- 
essarily imply similarity or relatedness relations. From a 
technological point of view, this observation implies that, 
before switching to automatic categorization of items in 
Wikipedia and in other information networks, it should 
be tested how the selected clustering algorithm performs 
with respect to manual indexing. 

* * * 
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